{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import Image, display"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Variance Analysis in Multistep Bootstrapping\n",
    "\n",
    "In order to understand the $Q(\\sigma)$ algorithm better, we will analyze and compare its performance to other multistep bootstrapping methods. Since $Q(\\sigma)$ is tied to the off policy n-step tree backup algorithm, we thus compare it to the off-policy variants of n-step SARSA, n-step expected SARSA, and to the n-step tree backup algorithm.\n",
    "\n",
    "#### About bootstrapping multistep methods:\n",
    "As [1] notes, bootstrapping methods work best when applied over multiple time steps, such that a significant and recognizable state change can be observed. From [1] also, note that multistep methods are usually associated with eligibility traces, but in this benchmark we consider only the multistep component of these algorithms.\n",
    "\n",
    "## Algorithmic Theoretical Background\n",
    "### Off-policy N-step SARSA and N-step expected SARSA [1]\n",
    "The regular SARSA(0) algorithm learns the value of state-action pairs and, after each transition from a nonterminal state $S_t$ applies the update equation:\n",
    "$$ Q(S,A) \\leftarrow Q(S,A) + \\alpha [ R + \\gamma Q(S', A') - Q(S,A)]$$\n",
    "\n",
    "In off-policy n-step SARSA, we still learn the value of state-action pairs but only update our estimate after $n$ steps, weighing the update using the importance sampling ratio $\\rho_{t}^{t+n}$ defined as the relative probability under the two policies of taking the n actions from $A_t$ to $A_{t+n-1}$:\n",
    "$$\\rho_t^{t+n} = \\prod_{k=t}^{min(t+n-1, T-1)}  \\frac{\\pi(A_k|S_k)}{\\mu(A_k|S_k)}$$\n",
    "\n",
    "And we use the update equation:\n",
    "\n",
    "$$ Q_{t+n}(S_t,A_t) = Q_{t+n-1}(S_t,A_t) + \\alpha \\rho_{t+1}^{t+n} [ G_t^{(n)} - Q_{t+n-1}(S_t,A_t)] $$\n",
    "\n",
    "with $G_t^{(n)}$ the n-step return defined in terms of estimated action values as:  \n",
    "\n",
    "$$ G_t^{(n)} = R_{t+1} + \\gamma R_{t+2} \\cdots + \\gamma^{n-1} R_{t+n} + \\gamma^n Q_{t+n-1}(S_{t+n},a) , n \\geq 1, 0\\leq t \\leq T-n$$\n",
    "\n",
    "\n",
    "For off-policy n-step expected SARSA, we simply change the computation of the n-step return to:\n",
    "$$ G_t^{(n)} = R_{t+1} + \\cdots + \\gamma^{n-1} R_{t+n} + \\gamma^n \\sum_a \\pi(a|S_{t+n}) Q_{t+n-1}(S_{t+n},a) , n \\geq 1, 0\\leq t \\leq T-n$$ \n",
    "\n",
    "### N-step Tree Backup Algorithm [1]\n",
    "In the n-step tree backup algorithm, we consider all possible actions from each state and estimate their values at step $t$. However we do not boostrap for the action that is actually taken at time $t$ as we therefore have a sample for this action. We thus go on to consider the state and actions that follow from having taken that action at time $t$, until $n$ steps have been exhausted.\n",
    "\n",
    "Note that each of the action values of the possible actions at a specific state gets weighted by its probability of being taken, with the taken action weighing the whole tree underneath it (= all the possible actions originating from having taken that action). Therefore all the actions underneath a taken action $a$ at step $t$ will have their probabilities weighted by the probability of taking $a$ at step $t$.\n",
    "\n",
    "We thus obtain the n-step return used in tree-backup:\n",
    "$$G_t^{(n)} = Q_{t-1}(S_t, A_t) + \\sum_{k=t}^{min(t+n-1, T-1)}\\delta_k \\prod_{i=t+1}^{k} \\gamma \\pi(A_i | S_i)$$\n",
    "\n",
    "where $$\\delta_t = R_{t+1} + \\gamma V_{t+1} - Q_{t-1}(S_t, A_t)$$\n",
    "\n",
    "This return is then used in the action-value update equation (same as n-step SARSA, without importance sampling): \n",
    "$$ Q_{t+n}(S_t,A_t) = Q_{t+n-1}(S_t,A_t) + \\alpha [ G_t^{(n)} - Q_{t+n-1}(S_t,A_t)] $$\n",
    "\n",
    "### N-Step $Q(\\sigma)$ algorithm [1]\n",
    "In the n-step $Q(\\sigma)$ algorithm, we alternate between the ideas from Sarsa and the tree backup algorithm. Thus at each step, based on the value of $\\sigma \\in [0,1]$, the $Q(\\sigma)$ update takes a fraction of the sample-based update (Sarsa) and the rest from the non sampling update (tree backup).\n",
    "[1] thus generalizes the TD error $\\delta_t$ to be a function of $\\sigma$:\n",
    "$$\\delta_t = R_{t+1} + \\gamma[\\sigma_{t+1}Q_{t}(S_{t+1}, A_{t+1}) + (1-\\sigma_{t+1}) V_{t+1}] - Q_{t-1}(S_t, A_t)$$\n",
    "\n",
    "which is then used in the n-step return equation :\n",
    "$$G_t^{(n)} = Q_{t-1}(S_t, A_t) + \\sum_{k=t}^{min(t+n-1, T-1)}\\delta_k \\prod_{i=t+1}^{k} \\gamma [(1-\\sigma_i)\\pi(A_i | S_i) + \\sigma_i]$$\n",
    "\n",
    "Finally we update the importance sampling ratio to take $\\sigma$ into account:\n",
    "$$\\rho_t^{t+n} = \\prod_{k=t}^{min(t+n-1, T-1)} (\\sigma_k \\frac{\\pi(A_k|S_k)}{\\mu(A_k|S_k)} + 1 - \\sigma_k)$$\n",
    "\n",
    "We can then use the above equations in the off-policy update equation from off-policy Sarsa."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Graphical Representation\n",
    "The backup diagrams for each of these algorithms is thus as shown in Figure 7.5 of [1]:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"images/backup_tree.png\" width=\"600\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Image(url= \"images/backup_tree.png\",width=600)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Brief Variance Theoretical Background\n",
    "\n",
    "## Off-Policy vs. On-Policy\n",
    "\n",
    "We begin with analyzing the variance in off-policy methods vs. on-policy methods. From [1], we know that for off-policy learning, importance sampling \"enables off-policy learning, but at the cost of increasing the variance of the updates.\" While weighted importance sampling [2] or adapative importance sampling (based on previous timestep variance) [3,6] can bring down this variance, typical n-step off-policy SARSA does not use this variance reduction technique as described in [1].\n",
    "\n",
    "### Getting Rid of Importance Sampling (n-step Tree Backup)\n",
    "\n",
    "In [4], where n-step Tree Backup is first introduced (without importance sampling in Section 4 of [4]), the authors bypass importance sampling by using backups from possible actions at each step combined with the probability of the target policy taking this action. This removes importance sampling and theoretically the variance introduced by it. Intuitively, this is the same bias-variance tradeoff as in Expected Sarsa. By using the expectation at each timestep we are bounding the variance (see [5] for more details on bias-variance tradeoff of Sarsa vs. Expected SARSA). \n",
    "\n",
    "### Combining Importance Sampling with Expecation in Tree Backup (i.e. n-step $Q(\\sigma)$)\n",
    "\n",
    "By adding function which balances sampling with the expectation backups of n-step tree backup, we can balance the variance. One can think of an easy extension such that the $\\sigma$ provided to the algorithm is a function of the previous step's variance (similar to adaptive importance sampling or autostep [3,6]). However, say that we simply choose the $\\sigma$ value to be a uniformly distributed random variable, we would expect that the variance would be about halfway between tree backup and n-step expected sarsa on average as it uses importance sampling for half of the time and the tree backup update half of the time.\n",
    "\n",
    "## Partial Sketch of Variance Analyses\n",
    "\n",
    "Note: We think our analysis here is outside the scope of the assignment, but we include some of our partial derivations regardless for those that might be interested in completing them with us.\n",
    "\n",
    "As in [5], we formulate the variance of n-step Sarsa as follows. Take $v_t = \\rho \\left[ G_t^{(n)} - Q_{t+n-1}(S_t, A_t) \\right]$\n",
    "\n",
    "We can then expand the variance term as follows:\n",
    "\n",
    "$$Var(s,a) = E \\left\\{ (v_t)^2 \\right\\} - (E \\left\\{ v_t \\right\\} )^2$$\n",
    "\n",
    "Ignoring the second term for now:\n",
    "\n",
    "$$Var(s,a) = E \\left\\{ \\rho^2 G_t^{(n)} + Q_{t+n-1}^2(s_t, a_t) - 2 \\rho G_t^{(n)} Q_{t+n-1}(s_t, a_t) \\right\\}$$\n",
    "\n",
    "Here, $G_t^{(n)} = \\sum_{k=1}^n \\gamma^{k-1} R_{t+k} + \\gamma^n \\sum_a \\pi(a | s_{t+n}) Q_{t+n-1}(s_t, a)$ \n",
    "\n",
    "Similarly, for n-step tree backup: $v_t = \\left[ \\hat{G}_t^{(n)} - Q_{t+n-1}(S_t, A_t) \\right]$, where $$\\hat{G}_t^{(n)} = Q_{t-1}(S_t, A_t) + \\sum_{k=t}^{min(t+n-1, T-1)}\\delta_k \\prod_{i=t+1}^{k} \\gamma \\pi(A_i | S_i)$$\n",
    "and $$\\delta_t = R_{t+1} + \\gamma V_{t+1} - Q_{t-1}(S_t, A_t)$$ and $V_{t+1} = \\sum_{a} \\pi(a|S_{t+1}) Q(S_{t+1}, a)$\n",
    "\n",
    "Thus, for n-strep tree backup the variance becomes partially (ignoring the $(E \\left\\{ v_t \\right\\} )^2$ term for now:\n",
    "\n",
    "$$Var(s,a) = E \\left\\{\\hat{G}_t^{(n)} + Q_{t+n-1}^2(s_t, a_t) - 2 \\hat{G}_t^{(n)} Q_{t+n-1}(s_t, a_t) \\right \\}$$\n",
    "\n",
    "Due to time we don't continue the derivations, but partially, we know that a main difference between the two terms is the importance sampling. We suggest that by expansion (reflecting what emperical evidence suggests in [2,4]), we expect that Sarsa will have higher variance when the $\\rho$ term is larger (i.e. when importance sampling plays a larger role in the terms the difference in variance between tree backup, where there is no importance sampling, and n-step sarsa, where there is, will be larger).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Experimental setup\n",
    "For our experimental environments we use a deterministic gridworld and a stochastic gridworld.\n",
    "\n",
    "### Stochastic gridworld\n",
    "In the stochastic version of our GridWorld problem, every action you take has a probability $p$ of going into the direction you want and probability $\\frac{1-p}{3}$ of going in each of the other directions. Otherwise it is identical to the deterministic GridWorld problem.\n",
    "\n",
    "### Deterministic gridworld\n",
    "In this version of gridworld, any action that is attempted is successful with 100% probability. For our experiments, we set up a grid world MDP similar to the one shown on Sutton page 82 (shown below), of size 5x5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"https://webdocs.cs.ualberta.ca/~sutton/book/ebook/figtmp15.png\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Image(url= \"https://webdocs.cs.ualberta.ca/~sutton/book/ebook/figtmp15.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementation details\n",
    "For all our algorithms, we have used a discount factor $\\gamma = 0.9$.  \n",
    "In our off-policy algorithms which use importance sampling, we've chosen the target policy to be $\\epsilon$-greedy with $\\epsilon = 0.1$, and the behavior policy to be $\\epsilon$-greedy with $\\epsilon = 0.3$.  \n",
    "Also note that in our implementation of the $Q(\\sigma)$ algorithm, $\\sigma$ is sampled from a uniform distribution in $[0,1]$.\n",
    "\n",
    "## Experimental results\n",
    "\n",
    "Here we first examine the effect of varying $\\alpha$, the learning rate on the average reward, the variance in average Q values, and the variance in obtained reward. As we've seen previously (see [Analysis of the bias variance tradeoff in SARSA variants](https://github.com/Breakend/SarsaVsExpectedSarsa/blob/master/BiasVarianceTradeoff.ipynb)), most algorithms suffer from a higher variance in average Q values, leading to a lower average reward, as the learning rate increases.  \n",
    "Again we expect here to see n-step expected Sarsa having lower Q variance and higher average reward than n-step sarsa. We also expect the n-step tree backup algorithm to have a lower Q variance than all other algorithms as it doesn't use importance sampling. And, we expect the $Q(\\sigma)$ algorithm to have a Q variance that is in-between the Q variance obtained for n-step expected Sarsa and tree backup as it uses a uniform random variable to balance the two.  \n",
    "\n",
    "For every experiment, we run all algorithms for a maximum of 10,000 episodes and average that across 5 trials. \n",
    "\n",
    "### Deterministic gridworld\n",
    "We start by running each of the algorithms on the 5x5 deterministic grid world, with a single small reward on the bottom right of the grid, equal to 1. We can see that the variance climbs up for tree backup and causes the average reward to go down at higher alpha values. This is not the case with the n-step Sarsa methods. We suspect that this relates to n-step Sarsa and n-step Expected Sarsa relying solely on the variance of the policy and environment, while tree backup and $Q(\\sigma)$ rely on other factors in their update function which can cause these variance fluctuations.\n",
    "\n",
    "##### Average Q Variance\n",
    "\n",
    "![alt](images/q_variance_vs_alpha_deterministic.png) \n",
    "\n",
    "##### Average Reward\n",
    "\n",
    "![alt](images/average_reward_vs_alpha_deterministic.png) \n",
    "\n",
    "##### Reward Variance\n",
    "\n",
    "![alt](images/variance_reward_vs_alpha_deterministic.png) \n",
    "\n",
    "\n",
    "Now we run each of the algorithms again on a 5x5 deterministic grid world while varying the number of steps n. However this time around, to analyze the effect of a higher reward variance in the environment we instead use a reward equal to 10 on the bottom right.  \n",
    "As we can observe in the graphs below, as expected the average reward increases while the variance in reward and the average Q variance decreases as the number of steps $n$ increases.  \n",
    "However we can observe here that the $Q(\\sigma)$ algorithm, unlike in the previous experiment, consistently obtains a higher reward than the other algorithms.\n",
    "\n",
    "##### Average Q Variance\n",
    "\n",
    "![alt](images/q_variance_vs_nstep_deterministic.png) \n",
    "\n",
    "##### Average Reward\n",
    "\n",
    "![alt](images/average_reward_vs_nstep_deterministic.png) \n",
    "\n",
    "##### Reward Variance\n",
    "\n",
    "![alt](images/variance_reward_vs_nstep_deterministic.png) \n",
    "\n",
    "\n",
    "\n",
    "### Stochastic gridworld\n",
    "\n",
    "We now run each of the algorithms on the 5x5 stochastic grid world with the same small reward on the bottom right of the grid and observe the effects of varying the learning rate $\\alpha$.\n",
    "\n",
    "##### Average Q variance\n",
    "![alt](images/q_variance_vs_alpha_stochastic.png) \n",
    "As we can see in the figure above, the variance in the average Q value increases as expected for all algorithms. \n",
    "The average Q variance is highest and increases the most for n-step Sarsa as expected, while the other algorithms do not suffer as much from increasing the $\\alpha$ value.  \n",
    "Confirming our hypothesis, the n-step expected Sarsa algorithm obtains a lower average Q variance compared to Sarsa and is bounded below by n-step $Q(\\sigma)$, which itself is bounded below by the lowest average Q variance algorithm, n-step tree backup.\n",
    "\n",
    "##### Average reward\n",
    "![alt](images/average_reward_vs_alpha_stochastic.png) \n",
    "\n",
    "Reflecting our results for the average Q variance, we observe in the figure above a decay in average reward as we increase the learning rate for all algorithms, and the highest average reward is obtained by the n-step tree backup algorithm, followed by n-step $Q(\\sigma)$, n-step expected Sarsa, and finally, n-step Sarsa.\n",
    "\n",
    "##### Reward variance\n",
    "\n",
    "![alt](images/variance_reward_vs_alpha_stochastic.png) \n",
    "\n",
    "Once again the results above reflect the lower variance and higher reward of the n-step tree backup algorithm compared to the others which find themselves in the same order again.\n",
    "\n",
    "\n",
    "Now we run the same experiment while varying the number of steps $n$, once again with a higher reward equal to 10 on the bottom right of the grid.  \n",
    "As we can see here, once again $Q(\\sigma)$ outperforms all methods for all n steps in terms of average reward. We again believe this to be a function of the reward setup in this experiment having greater variance (a +10 reward on the goal vs -1 on each transition). Tree backup is however the least affected by the number of steps in both variance and reward.  \n",
    "And once again we can observe that overall, however, more steps decreases the variance and increases the reward until a certain point.\n",
    "\n",
    "##### Average Q Variance\n",
    "\n",
    "![alt](images/q_variance_vs_nstep_stochastic.png) \n",
    "\n",
    "##### Average Reward\n",
    "\n",
    "![alt](images/average_reward_vs_nstep_stochastic.png) \n",
    "\n",
    "##### Reward Variance\n",
    "\n",
    "![alt](images/variance_reward_vs_nstep_stochastic.png) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Experimenting with $Q(\\sigma)$\n",
    "One main strengths of $Q(\\sigma)$ is being able to balance using importance sampling vs tree backup updates. We have seen that with a uniformly distributed sigma where the reward variance is not high, Q sigma performs as expected slightly below tree backup. However, if the reward variance is higher, it performs not as well.  \n",
    "Thus, $Q(\\sigma)$ can be tuned with custom sigma functions to attempt to improve performance for different environments by controlling the source of the variance.\n",
    "\n",
    "We experiment with our own custom function as a step in this direction. We take the variance of the Q function over the last n steps. If the latest Q function variance is greater than the average of the last n variances, we set sigma to 0 (thereby relying on the tree backup update). In this way we try to control the variance in different environments and flip solely to the tree backup model. While we don't have the best results, we can see below that this in fact does improve on the above average reward at higher alphas in a Stochastic GridWorld model.\n",
    "\n",
    "![alt](images/average_reward_with_custom_sigma.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Practical Lessons\n",
    "\n",
    "## Computational Efficiency (n-step methods aren't great)\n",
    "\n",
    "Overall, we found n-step methods to be far more computationally intensive non-n-step variations on Sarsa. Some statistics on runtimes for 1000 episodes on a 5x5 Deterministic Gridworld on machine with Intel i5 processor (6600):\n",
    "\n",
    "+ N-step Sarsa (average runtime): 3.7 min\n",
    "+ N-Step Expected Sarsa (average runtime): 5.15 min\n",
    "+ N-Step Tree backup (average runtime): 5.35 min\n",
    "+ N-Step Q-Sigma (average runtime): 8.3 min\n",
    "+ Sarsa (average runtime): 29 seconds\n",
    "+ Expected Sarsa (average runtime): 34 seconds\n",
    "+ Double Expected Sarsa (average runtime): 35 seconds\n",
    "+ Double Sarsa (average runtime): 32 seconds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Citations and Footnotes\n",
    "[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998.\n",
    "\n",
    "[2] Precup, Doina, Richard S. Sutton, and Sanjoy Dasgupta. \"Off-policy temporal-difference learning with function approximation.\" ICML. 2001.\n",
    "\n",
    "[3] Hachiya, Hirotaka, et al. \"Adaptive importance sampling for value function approximation in off-policy reinforcement learning.\" Neural Networks 22.10 (2009): 1399-1410.\n",
    "\n",
    "[4] Precup, Doina. \"Eligibility traces for off-policy policy evaluation.\" Computer Science Department Faculty Publication Series (2000): 80.\n",
    "\n",
    "[5] Van Seijen, Harm, et al. \"A theoretical and empirical analysis of Expected Sarsa.\" Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL'09. IEEE Symposium on. IEEE, 2009.\n",
    "\n",
    "[6] Mahmood, Ashique Rupam, et al. \"Tuning-free step-size adaptation.\" Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
